In many countries, housing price is determined by the size of a house. For example, in the same community, the sale price is calculated by using house area and price per square meter. No matter how many bedrooms one house has if two houses are the same size, their sale prices are the same. In this case, there is a linear relationship between the housing price and the house size. In a city with large population density such as Hong Kong, housing price is incredibly high and the only housing option for the majority is apartments.
Fig 1. Hong Kong Night View
However, in the United States, housing price is not determined by a single vector. There are many explanatory variables that contributes to the housing price. People could live in a house with its own garage and pool.
Fig 2. American Housing
Now, let’s take a look at one Kaggle competition: ”House Prices - Advanced Regression Techniques”.
For this project, we will work with the file "test.csv", found in /data/house_prices. The file is from Kaggle: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data.
library(recipes)
library(workflows)
library(dplyr)
library(parsnip)
library(tidymodels)
library(ISLR)
library(ISLR2)
library(tidyverse)
library(glmnet)
tidymodels_prefer()
library(ggplot2)
library(readr)
library(gplots)
library(repr)
library(caret)
library(rpart.plot)
library(vip)
library(discrim)
library(poissonreg)
library(corrr)
library(klaR) # for naive bayes
library(janitor)
library(rsample)
After read the csv file, we will use janitor package to clean the names of the variables. Then, we have to check the missing data in the dataset and be familiarized with the dataset variables.
train_data <- read_csv("~/Downloads/house_prices/train.csv") %>%
clean_names()
test_data <- read_csv("~/Downloads/house_prices/test.csv") %>%
clean_names()
missing_row <- train_data[!complete.cases(train_data),]
head(missing_row)
## # A tibble: 6 × 81
## id ms_sub_class ms_zoning lot_frontage lot_area street alley lot_shape
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## # … with 73 more variables: land_contour <chr>, utilities <chr>,
## # lot_config <chr>, land_slope <chr>, neighborhood <chr>, condition1 <chr>,
## # condition2 <chr>, bldg_type <chr>, house_style <chr>, overall_qual <dbl>,
## # overall_cond <dbl>, year_built <dbl>, year_remod_add <dbl>,
## # roof_style <chr>, roof_matl <chr>, exterior1st <chr>, exterior2nd <chr>,
## # mas_vnr_type <chr>, mas_vnr_area <dbl>, exter_qual <chr>, exter_cond <chr>,
## # foundation <chr>, bsmt_qual <chr>, bsmt_cond <chr>, bsmt_exposure <chr>, …
variables <- names(train_data)
variables
## [1] "id" "ms_sub_class" "ms_zoning" "lot_frontage"
## [5] "lot_area" "street" "alley" "lot_shape"
## [9] "land_contour" "utilities" "lot_config" "land_slope"
## [13] "neighborhood" "condition1" "condition2" "bldg_type"
## [17] "house_style" "overall_qual" "overall_cond" "year_built"
## [21] "year_remod_add" "roof_style" "roof_matl" "exterior1st"
## [25] "exterior2nd" "mas_vnr_type" "mas_vnr_area" "exter_qual"
## [29] "exter_cond" "foundation" "bsmt_qual" "bsmt_cond"
## [33] "bsmt_exposure" "bsmt_fin_type1" "bsmt_fin_sf1" "bsmt_fin_type2"
## [37] "bsmt_fin_sf2" "bsmt_unf_sf" "total_bsmt_sf" "heating"
## [41] "heating_qc" "central_air" "electrical" "x1st_flr_sf"
## [45] "x2nd_flr_sf" "low_qual_fin_sf" "gr_liv_area" "bsmt_full_bath"
## [49] "bsmt_half_bath" "full_bath" "half_bath" "bedroom_abv_gr"
## [53] "kitchen_abv_gr" "kitchen_qual" "tot_rms_abv_grd" "functional"
## [57] "fireplaces" "fireplace_qu" "garage_type" "garage_yr_blt"
## [61] "garage_finish" "garage_cars" "garage_area" "garage_qual"
## [65] "garage_cond" "paved_drive" "wood_deck_sf" "open_porch_sf"
## [69] "enclosed_porch" "x3ssn_porch" "screen_porch" "pool_area"
## [73] "pool_qc" "fence" "misc_feature" "misc_val"
## [77] "mo_sold" "yr_sold" "sale_type" "sale_condition"
## [81] "sale_price"
There are 81 different variables in the dataset. Since simple linear regression only uses numerical data, in this case, all the non-numerical variables is out of consideration. After selecting all the numerical variables from the original dataset, there are 13 predictors and 1 outcome variable left.
select_train <- train_data %>%
select(overall_qual,overall_cond,year_built, gr_liv_area, bedroom_abv_gr, kitchen_abv_gr, tot_rms_abv_grd,
fireplaces, garage_area, open_porch_sf, pool_area, mo_sold, yr_sold, sale_price)
head(select_train)
## # A tibble: 6 × 14
## overall_qual overall_cond year_built gr_liv_area bedroom_abv_gr kitchen_abv_gr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 5 2003 1710 3 1
## 2 6 8 1976 1262 3 1
## 3 7 5 2001 1786 3 1
## 4 7 5 1915 1717 3 1
## 5 8 5 2000 2198 4 1
## 6 5 5 1993 1362 1 1
## # … with 8 more variables: tot_rms_abv_grd <dbl>, fireplaces <dbl>,
## # garage_area <dbl>, open_porch_sf <dbl>, pool_area <dbl>, mo_sold <dbl>,
## # yr_sold <dbl>, sale_price <dbl>
summary(select_train$sale_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
hist(select_train$sale_price, xlab = "sale price", ylab = "counts", xlim = c(30000, 760000), main = "Histogram of sale price")
select_train <- select_train %>%
mutate(sale_price = log(sale_price))
summary(select_train$sale_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.46 11.78 12.00 12.02 12.27 13.53
hist(select_train$sale_price, xlab = "sale price", ylab = "counts", main = "Histogram of sale price")
typeof(select_train)
## [1] "list"
names(select_train)
## [1] "overall_qual" "overall_cond" "year_built" "gr_liv_area"
## [5] "bedroom_abv_gr" "kitchen_abv_gr" "tot_rms_abv_grd" "fireplaces"
## [9] "garage_area" "open_porch_sf" "pool_area" "mo_sold"
## [13] "yr_sold" "sale_price"
Because we know nothing about the correlation of the variables, we can randomly select one variable as the predictor and check the accuracy of the model by using its R-squared value. For example, we are interested in the relationship between the sale price and total number of rooms the house has.
reg_rms <- lm(sale_price ~ tot_rms_abv_grd, data = select_train)
summary(reg_rms)
##
## Call:
## lm(formula = sale_price ~ tot_rms_abv_grd, data = select_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4133 -0.1940 0.0071 0.2029 1.0556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.16802 0.03654 305.62 <2e-16 ***
## tot_rms_abv_grd 0.13134 0.00544 24.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3377 on 1458 degrees of freedom
## Multiple R-squared: 0.2856, Adjusted R-squared: 0.2851
## F-statistic: 582.9 on 1 and 1458 DF, p-value: < 2.2e-16
ggplot(select_train, aes(x = tot_rms_abv_grd, y = sale_price)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
It has an adjusted R-squared value of 0.2851, which means that only 28.51% of the data is explained by the total number of rooms the house has. Apparently, we are not going to draw the linear regression graph between sale price and each explanatory variables. Indeed, we will draw a correlation graph of all our selected variables and choose the one with the highest correlation coefficient with the outcome variables.
cor_select <- select_train %>%
correlate()
rplot(cor_select)
cor_select %>%
stretch() %>%
ggplot(aes(x, y, fill = r)) +
geom_tile() +
geom_text(aes(label = as.character(fashion(r))))+
scale_x_discrete(guide = guide_axis(angle = 45)) +
NULL
#### Simple Linear Regression
Based on the correlation graph, we can see that the overall quality of a house has the highest correlation with sale price. Thus, the best simple linear regression model will has overall quality as the explanatory variable and sale price as the dependent variable.
reg_qual <- lm(sale_price ~ overall_qual, data = select_train)
summary(reg_qual)
##
## Call:
## lm(formula = sale_price ~ overall_qual, data = select_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06831 -0.12974 0.01309 0.13332 0.92438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.58444 0.02727 388.18 <2e-16 ***
## overall_qual 0.23603 0.00436 54.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2303 on 1458 degrees of freedom
## Multiple R-squared: 0.6678, Adjusted R-squared: 0.6676
## F-statistic: 2931 on 1 and 1458 DF, p-value: < 2.2e-16
The adjusted R-squared value is 0.6676 that means the model has an accuracy of 66.76%. The coefficient of the predictor value is 0.236 shows that the overall quality has a positive correlation with the sale price. Due to the limitation of the simple linear regression model, this is the best model we can built so far.
ggplot(select_train, aes(x = overall_qual, y = sale_price)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
In this section, we will introduce another model which is multiple linear regression model. As what mentioned in the previous section, adding more predictor variables can help increasing the accuracy of our prediction model. First, we will try to use all the predictors.
reg <- lm(sale_price~.-sale_price, data = select_train)
summary(reg)
##
## Call:
## lm(formula = sale_price ~ . - sale_price, data = select_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86658 -0.08263 0.00559 0.09285 0.52817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.883e+00 6.572e+00 1.504 0.132814
## overall_qual 8.994e-02 5.149e-03 17.467 < 2e-16 ***
## overall_cond 5.579e-02 4.280e-03 13.034 < 2e-16 ***
## year_built 3.923e-03 2.024e-04 19.383 < 2e-16 ***
## gr_liv_area 2.348e-04 1.787e-05 13.142 < 2e-16 ***
## bedroom_abv_gr -8.656e-03 7.573e-03 -1.143 0.253187
## kitchen_abv_gr -7.678e-02 2.184e-02 -3.516 0.000452 ***
## tot_rms_abv_grd 1.441e-02 5.681e-03 2.536 0.011320 *
## fireplaces 7.445e-02 7.746e-03 9.611 < 2e-16 ***
## garage_area 2.999e-04 2.601e-05 11.529 < 2e-16 ***
## open_porch_sf -7.573e-06 6.968e-05 -0.109 0.913472
## pool_area -3.004e-04 1.091e-04 -2.754 0.005958 **
## mo_sold 5.523e-04 1.611e-03 0.343 0.731718
## yr_sold -3.480e-03 3.271e-03 -1.064 0.287515
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1632 on 1446 degrees of freedom
## Multiple R-squared: 0.8345, Adjusted R-squared: 0.833
## F-statistic: 560.9 on 13 and 1446 DF, p-value: < 2.2e-16
As we can see there is a significant improve in the model accuracy, but some predictors are not significant in the prediction model. The model assumed that all the predictors can be used as an explanatory variables. However in statistics, predictors has a p-value less than 0.05 is considered as statistically significant. So, we can eliminate those statistically insignificant predictors and does not harm the accuracy of the model.
reg <- lm(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + kitchen_abv_gr + fireplaces + garage_area + pool_area, data = select_train)
summary(reg)
##
## Call:
## lm(formula = sale_price ~ overall_qual + overall_cond + year_built +
## gr_liv_area + kitchen_abv_gr + fireplaces + garage_area +
## pool_area, data = select_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.90258 -0.08228 0.00475 0.09166 0.52893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9584236 0.3981991 7.430 1.85e-13 ***
## overall_qual 0.0910835 0.0050405 18.070 < 2e-16 ***
## overall_cond 0.0555227 0.0042735 12.992 < 2e-16 ***
## year_built 0.0038962 0.0002021 19.282 < 2e-16 ***
## gr_liv_area 0.0002624 0.0000118 22.242 < 2e-16 ***
## kitchen_abv_gr -0.0621671 0.0209142 -2.972 0.00300 **
## fireplaces 0.0746647 0.0076916 9.707 < 2e-16 ***
## garage_area 0.0003016 0.0000258 11.690 < 2e-16 ***
## pool_area -0.0003229 0.0001083 -2.981 0.00292 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1634 on 1451 degrees of freedom
## Multiple R-squared: 0.8336, Adjusted R-squared: 0.8327
## F-statistic: 908.8 on 8 and 1451 DF, p-value: < 2.2e-16
Now, we have a smaller group of predictors. but the same R-squared value. When we look more closely to the data, we will find out that only one house in the dataset has a pool, which means more than 99% of the house has a pool area of 0. Sometimes when we have several predictors and a R-squared value, the model may run into a problem called overfitting. Next, we will try to reduced the size of predictors and check the difference in model accuracy.
reg1 <- lm(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces + garage_area, data = select_train)
summary(reg1)
##
## Call:
## lm(formula = sale_price ~ overall_qual + overall_cond + year_built +
## gr_liv_area + fireplaces + garage_area, data = select_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01708 -0.08287 0.00611 0.09265 0.53421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.782e+00 3.956e-01 7.033 3.09e-12 ***
## overall_qual 9.437e-02 4.976e-03 18.964 < 2e-16 ***
## overall_cond 5.703e-02 4.261e-03 13.384 < 2e-16 ***
## year_built 3.948e-03 2.024e-04 19.509 < 2e-16 ***
## gr_liv_area 2.486e-04 1.132e-05 21.956 < 2e-16 ***
## fireplaces 7.752e-02 7.640e-03 10.147 < 2e-16 ***
## garage_area 3.014e-04 2.593e-05 11.624 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1642 on 1453 degrees of freedom
## Multiple R-squared: 0.8317, Adjusted R-squared: 0.831
## F-statistic: 1197 on 6 and 1453 DF, p-value: < 2.2e-16
reg2 <- lm(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces, data = select_train)
summary(reg2)
##
## Call:
## lm(formula = sale_price ~ overall_qual + overall_cond + year_built +
## gr_liv_area + fireplaces, data = select_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.93571 -0.08658 0.00574 0.10183 0.55764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.580e+00 3.990e-01 3.960 7.84e-05 ***
## overall_qual 1.052e-01 5.110e-03 20.582 < 2e-16 ***
## overall_cond 5.710e-02 4.453e-03 12.822 < 2e-16 ***
## year_built 4.570e-03 2.039e-04 22.411 < 2e-16 ***
## gr_liv_area 2.816e-04 1.145e-05 24.583 < 2e-16 ***
## fireplaces 7.852e-02 7.984e-03 9.834 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1716 on 1454 degrees of freedom
## Multiple R-squared: 0.816, Adjusted R-squared: 0.8154
## F-statistic: 1290 on 5 and 1454 DF, p-value: < 2.2e-16
reg_multi <- lm(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces, data = select_train)
summary(reg_multi)
##
## Call:
## lm(formula = sale_price ~ overall_qual + overall_cond + year_built +
## gr_liv_area + fireplaces, data = select_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.93571 -0.08658 0.00574 0.10183 0.55764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.580e+00 3.990e-01 3.960 7.84e-05 ***
## overall_qual 1.052e-01 5.110e-03 20.582 < 2e-16 ***
## overall_cond 5.710e-02 4.453e-03 12.822 < 2e-16 ***
## year_built 4.570e-03 2.039e-04 22.411 < 2e-16 ***
## gr_liv_area 2.816e-04 1.145e-05 24.583 < 2e-16 ***
## fireplaces 7.852e-02 7.984e-03 9.834 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1716 on 1454 degrees of freedom
## Multiple R-squared: 0.816, Adjusted R-squared: 0.8154
## F-statistic: 1290 on 5 and 1454 DF, p-value: < 2.2e-16
Without predictors: ”kitchen abv gr” and ”pool area”, the model remains the same level of accuracy. Thus, ”kitchen abv gr” may be dependent with another predictor variable and ”pool area” can be seen as a zero vector. After eliminated these two predictor vectors, the predictor matrix has the same rank. So, the new model has the same accuracy level, but less predictors.
After we have the best model we can achieve using multiple linear regression model. Out of curiosity, we can use machine learning to test our model. First of all, we split the data into training and testing groups. Since the test file that Kaggle provided does not has sale price, we will use the train file instead. Then, we set up a recipe with our multiple regression model and fit our model to the train data. Next, we can predict the sale price of our test data and comparing it with the actual price.
set.seed(123)
select_split <- initial_split(select_train, prop = 0.80,
strata = sale_price)
train <- training(select_split)
test <- testing(select_split)
house_recipe <- recipe(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces + garage_area, data = train)
lm_model <- linear_reg() %>%
set_engine("lm")
lm_wflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(house_recipe)
lm_fit <- fit(lm_wflow, train)
lm_fit %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.52 0.434 5.80 8.48e- 9
## 2 overall_qual 0.0953 0.00545 17.5 5.99e-61
## 3 overall_cond 0.0590 0.00477 12.4 4.78e-33
## 4 year_built 0.00407 0.000222 18.3 3.37e-66
## 5 gr_liv_area 0.000252 0.0000130 19.4 9.06e-73
## 6 fireplaces 0.0742 0.00850 8.73 9.10e-18
## 7 garage_area 0.000306 0.0000285 10.8 8.69e-26
By combining the model prediction values with the actual values, we can generate a plot of predicted values vs. actual values.
test_res <- predict(lm_fit, new_data = test %>% select(-sale_price))
test_res %>%
head()
## # A tibble: 6 × 1
## .pred
## <dbl>
## 1 12.4
## 2 11.7
## 3 11.7
## 4 11.3
## 5 11.9
## 6 11.7
test_res <- bind_cols(test_res, test %>% select(sale_price))
test_res %>%
head()
## # A tibble: 6 × 2
## .pred sale_price
## <dbl> <dbl>
## 1 12.4 12.2
## 2 11.7 11.8
## 3 11.7 11.9
## 4 11.3 11.1
## 5 11.9 11.9
## 6 11.7 11.6
test_res %>%
ggplot(aes(x = .pred, y = sale_price)) +
geom_point(alpha = 0.2) +
geom_abline(lty = 2) +
theme_bw() +
coord_obs_pred()
As we can see that the dots spread along the line, the model did a pretty good job in predicting the sale price. We can also check the RMSE value and R-squared value to evaluate the model accuracy.
predict_metrics <- metric_set(rmse, rsq, mae)
predict_metrics(test_res, truth = sale_price,
estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.168
## 2 rsq standard 0.827
## 3 mae standard 0.118
From our prediction model and explanatory variables, we find out that in the data set, only one house has a pool.
summary(train$pool_area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 2.54 0.00 738.00
The number of fire places plays an important rule in the prediction model. So, we can guess that those data are collected from a cold place. In the next section, we can add some data about heating and other house conditions in another model.
In this section, we will explore more explanatory variables. Since we did not consider the non-numerical data, we will use more methods, such as classification and regression tree models to predict the housing prices. As we said in the previous section, these data may come from a cold place, so we want to evaluate the heating condition and air control in the house. Since in cold places, people spend a lot of time indoor and these vectors are definitely what they care about when they are buying a house.
#Since each of the factors evaluate the condtion of the exterior material and heating conditon, we can convert the factors to numerical value.
train_data$exter_cond <- as.numeric(factor(train_data$exter_cond,
levels = c("Ex", "Fa","Gd", "TA","Po"),
labels = c(5,2,4,3,1) ,ordered = TRUE))
train_data$heating_qc <- as.numeric(factor(train_data$heating_qc,
levels = c("Ex", "Fa","Gd", "TA","Po"),
labels = c(5,2,4,3,1) ,ordered = TRUE))
Now, we can create a new select dataset and add in our new predictors.
new_select_data <- train_data %>%
select(overall_qual, overall_cond, year_built, gr_liv_area, fireplaces, garage_area, central_air, heating_qc, exter_cond, sale_price)
head(new_select_data)
## # A tibble: 6 × 10
## overall_qual overall_cond year_built gr_liv_area fireplaces garage_area
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 5 2003 1710 0 548
## 2 6 8 1976 1262 1 460
## 3 7 5 2001 1786 1 608
## 4 7 5 1915 1717 1 642
## 5 8 5 2000 2198 1 836
## 6 5 5 1993 1362 0 480
## # … with 4 more variables: central_air <chr>, heating_qc <dbl>,
## # exter_cond <dbl>, sale_price <dbl>
Similar to the first regression part, we will add cross fold validation.
set.seed(131)
new_split <- initial_split(new_select_data, prop =0.8, strata = "sale_price")
new_train <- training(new_split)
new_test <- testing(new_split)
new_fold <- vfold_cv(new_train, v = 5, strata = "sale_price")
Here is a new recipe that we will use in the classification tree model.
price_recipe <- recipe(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces + garage_area + central_air + heating_qc + exter_cond, data = new_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
class.tree <- rpart(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces + garage_area + central_air + heating_qc + exter_cond,
data = new_train, control = rpart.control(cp = 0.01))
plotcp(class.tree)
printcp(class.tree)
##
## Regression tree:
## rpart(formula = sale_price ~ overall_qual + overall_cond + year_built +
## gr_liv_area + fireplaces + garage_area + central_air + heating_qc +
## exter_cond, data = new_train, control = rpart.control(cp = 0.01))
##
## Variables actually used in tree construction:
## [1] gr_liv_area overall_qual year_built
##
## Root node error: 7.574e+12/1166 = 6495687471
##
## n= 1166
##
## CP nsplit rel error xerror xstd
## 1 0.447474 0 1.00000 1.00102 0.081591
## 2 0.123676 1 0.55253 0.55574 0.042324
## 3 0.063138 2 0.42885 0.43228 0.040314
## 4 0.037239 3 0.36571 0.36870 0.028696
## 5 0.022119 4 0.32847 0.33228 0.028287
## 6 0.020687 5 0.30635 0.32322 0.028286
## 7 0.017157 6 0.28567 0.30965 0.027618
## 8 0.013418 7 0.26851 0.29745 0.027263
## 9 0.010000 8 0.25509 0.27643 0.024881
rpart.plot(class.tree)
Other than classification trees, we also want to try regression trees and we will use the recipe from our previous section.
tree_spec <- decision_tree() %>%
set_engine("rpart")
reg_tree_spec <- tree_spec %>%
set_mode("regression")
reg_tree_fit <- fit(reg_tree_spec,
sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces + garage_area + central_air + heating_qc + exter_cond, new_train)
augment(reg_tree_fit, new_data = new_test) %>%
rmse(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 38992.
reg_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
reg_tree_wf <- workflow() %>%
add_model(reg_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_formula(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces)
set.seed(3435)
reg_fold <- vfold_cv(new_train)
param_grid <- grid_regular(cost_complexity(range = c(-4, -1)), levels = 5)
tune_res <- tune_grid(
reg_tree_wf,
resamples = reg_fold,
grid = param_grid
)
autoplot(tune_res)
We select the best-performing model according to
rmse and fit the final model on the whole training data set.
best_complexity <- select_best(tune_res, metric = "rmse")
reg_tree_final <- finalize_workflow(reg_tree_wf, best_complexity)
reg_tree_final_fit <- fit(reg_tree_final, data = new_train)
Now we can see a more complex decision tree than the previous one.
reg_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
Now in this section, we will explore bagging and random forest model with our new predictors and test its accuracy.
bagging_spec <- rand_forest(mtry = .cols()) %>%
set_engine("randomForest", importance = TRUE) %>%
set_mode("regression")
bagging_fit <- fit(bagging_spec, sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + fireplaces + garage_area + heating_qc + exter_cond,
data = new_train)
augment(bagging_fit, new_data = new_test) %>%
rmse(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 26779.
Let’s check out our precition comparing to the actual test value
augment(bagging_fit, new_data = new_test) %>%
ggplot(aes(sale_price, .pred)) +
geom_abline() +
geom_point(alpha = 0.5)
We can see that our model did a pretty good job, but there is also some outliers at the end.
Then let’s check the importance of the variables and see if it fits our previous prediction.
vip(bagging_fit)
As we can see that the above ground living area plays a big role in the prediction. However, there are extra factors such as fireplace and heating contion also important to the data set.
From all the model we fit and data we trained, we can say that people cares about the area above ground the most and the garage area of the house. But comparing to the model we did in lab7, these data also has important predictors such as fireplace and heating condition. Thus we may conclude that from this Kaggle competition, the housing data may come from a cold place in the U.S or European countries, where has sparse population and more land per people. In Asian countries, there is almost no tradition of fireplaces in the house and they also adopt another housing type in the cities.